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Abstract 

Background: Large-scale annotation efforts have improved our ability to coarsely predict regulatory elements 
throughout vertebrate genomes. However, it is unclear how complex spatiotemporal patterns of gene expression 
driven by these elements emerge from the activity of short, transcription factor binding sequences. 

Results: We describe a comprehensive promoter extension assay in which the regulatory potential of all 6 base- 
pair (bp) sequences was tested in the context of a minimal promoter. To enable this large-scale screen, we 
developed algorithms that use a reverse-complement aware decomposition of the de Bruijn graph to design a 
library of DMA oligomers incorporating every 6-bp sequence exactly once. Our library multiplexes all 4,096 unique 
6-mers into 184 double-stranded 15-bp oligomers, which is sufficiently compact for in vivo testing. We injected 
each multiplexed construct into zebrafish embryos and scored GFP expression in 15 tissues at two developmental 
time points. Twenty-seven constructs produced consistent expression patterns, with the majority doing so in only 
one tissue. Functional sequences are enriched near biologically relevant genes, match motifs for developmental 
transcription factors, and are required for enhancer activity. By concatenating tissue-specific functional sequences, 
we generated completely synthetic enhancers for the notochord, epidermis, spinal cord, forebrain and otic lateral 
line, and show that short regulatory sequences do not always function modularly. 

Conclusions: This work introduces a unique in vivo catalog of short, functional regulatory sequences and 
demonstrates several important principles of regulatory element organization. Furthermore, we provide resources 
for designing compact, reverse-complement aware /c-mer libraries. 



Background 

The activity of regulatory elements is governed by tran- 
scription factors (TFs), which bind to DNA sequences 
averaging 6 to 9 bp in size [1] and function in combina- 
tion to achieve diverse gene expression patterns. Knowl- 
edge of the vertebrate TF repertoire and individual TP 
binding site (TFBS) preferences is still far from com- 
plete, due to binding-site degeneracy and the challenge 
of annotating functional TFBSs that may occur over a 
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million base pairs from the transcription start site (TSS) 
of their target gene. Even less is known about the mini- 
mal functional elements that are sufficient to drive tis- 
sue-specific expression, and how elaborate patterns 
result from their interplay. Deciphering the vocabulary 
and grammar of the vertebrate regulatory code will 
enable high-resolution mapping of regulatory elements, 
accurate interpretation of nucleotide variation in regula- 
tory sequences, and the design of regulatory elements to 
deliver genes to specific tissues for research and thera- 
peutic purposes. 

Many top-down approaches are used to predict verte- 
brate regulatory sequences. For a subset of vertebrate TFs, 
motifs derived from in vitro measurements of binding affi- 
nity [2,3] can be used to identify potential TFBSs in geno- 
mic sequences. Distinguishing which TFBSs are functional 
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is challenging, even with the aid of evolutionary conserva- 
tion [4,5] and clustering of TFBSs [6,7]. Techniques that 
combine biochemical assays with massively parallel 
sequencing, such as ChlP-seq, DNase-seq and formalde- 
hyde-assisted isolation of regulatory elements (FAIRE)- 
seq, identify potential gene-regulatory elements by mea- 
suring TF binding and open chromatin in a specific cell 
type on a genome-wide scale [8-17]. However, these meth- 
ods only coarsely locate functional sequences and are pre- 
dictive in nature, requiring downstream validation. There 
is evidence that regulatory signals may be short (approxi- 
mately 44 bp) [18], but canonical regulatory elements 
identified by these techniques tend to be quite long. For 
example, the average size of the 846 functional elements 
currently listed in the VISTA enhancer browser is 1,812 
bp [19]. Traditional efforts to bash larger elements for 
core domains [18] are effective, but are time-consuming 
and costly. 

To complement existing methods for identifying regu- 
latory regions, it will be essential to reverse-engineer reg- 
ulatory elements to reveal the fundamental units that 
drive complex expression patterns and the rules that gov- 
ern their organization. One study that employed such an 
approach [20] m vitro tested 100-bp synthetic elements, 
each composed of tandem repeats of multiplexed 10-bp 
sequences, in six cell lines. Considerable differences in 
expression levels were observed between cell lines, con- 
sistent with the tissue-specificity of native vertebrate reg- 
ulatory elements [8]. However, in vivo assays are essential 
to understanding how the vertebrate regulatory logic 
functions in a physiological environment involving multi- 
ple cell types, cellular interactions and temporal stages. 

Here, we developed a bottom-up approach to charac- 
terize regulatory element vocabulary by systematically 
cataloging the effect of each 6-bp sequence (6-mer) on a 
TATA-box-containing 42-bp minimal promoter using 
zebrafish transgenics. To make this comprehensive 
screen feasible in vivo, we multiplexed the 6-mers into an 
ultra-compact library of 184 double-stranded constructs, 
each 15-bp in length. The activity of each construct was 
then scored in 15 tissues at 24 and 48 hours post-fertili- 
zation (hpf). We identified 27 multiplexed oligomers 
(15%) that produced distinct, consistent expression pat- 
terns. Four such constructs with distinct expression pro- 
files were functionally dissected to identify even shorter 
9-bp sequences that match binding motifs of known TF 
families, suggesting a mechanism by which they achieve 
tissue-specificity. Removal of these functional sequences 
from native zebrafish enhancers reduced their activity. 
Functional sequences were also used to design larger 
completely synthetic homotypic regulatory modules that 
drove strong expression to specific tissues. In contrast, 
heterotypic combinations of synthetic concatemers 
revealed a non-modular grammar. 



Results 

Design of an ultra-compact library of multiplexed 6-mers 

The number n/^ of unique sequences of length k (/c-mers) is 
4^ and therefore, values in the TFBS-length range of 6 to 
9 bp produce an infeasible number of sequences for indivi- 
dual testing. On the other hand, values of k less than 6 
would not be expected to cover a substantial portion of 
the TFBS binding spectrum. A solution to this problem is 
to multiplex /c-mers into longer oligomers. However, the 
length of the oligomers is inversely correlated with the 
information that can be learned about individual /c-mers 
and proportional to the follow-up testing (for example, 
bashing) required for identifying the active /c-mers. Addi- 
tionally, inefficient multiplexing (Figure la), exemplified 
by concatenating /c-mers or randomly generating oligo- 
mers, results in large libraries with biased /c-mer coverage. 
By developing mathematical and computational techni- 
ques for efficiently multiplexing /c-mers into oligomers 
(Additional file 1), we balanced these criteria at /c = 6 {n/^ = 
4,096) and an oligomer length of 15 bp. 

Compact /c-mer libraries have been used in several appli- 
cations, such as protein-binding microarrays [3], universal 
DNA tag systems [21], and microfluidic affinity analysis 
devices [22]. This type of library is constructed from divid- 
ing up a de Bruijn sequence, which contains one copy of 
every /c-mer. This sequence can be generated from a de 
Bruijn graph (Additional file 2), a mathematical object that 
has been applied in several biological contexts, particularly 
read assembly [23]. For k = 6 and a length of 15 bp, such a 
library would contain approximately 400 oligomers. 

The theory underlying de Bruijn sequences was devel- 
oped for general alphabets, not DNA. With the excep- 
tion of self-reverse-complementary palindromes, each k- 
mer has a distinct reverse complement that is presum- 
ably equivalent to a TF binding double-stranded DNA. 
By restricting each reverse-complementary pair to a sin- 
gle representative on the forward strand, we reduced the 
library size by approximately 50%. Our design, which we 
call a minimal reverse-complementary-covering (MRCC) 
library of order k and size m, is a collection of m DNA 
sequences in which a single representative for each 
reverse-complementary pair of /c-mers appears exactly 
once. To create an MRCC library, we designed an algo- 
rithm (Additional file 1) that computes a new type of 
palindrome-aware edge decomposition of the de Bruijn 
graph (Figure la) and enables us to account for reverse 
complementarity. In contrast to a de Bruijn sequence, 
which is obtained by computing a single cyclical path 
that traverses every edge in the graph exactly once 
(Additional file 2), an MRCC library is obtained by par- 
titioning the edges into many paths. We note that for 
odd values of /c, the algorithm is similar to a previous 
heuristic [24] that attempts to compute an analog to 
the de Bruijn sequence. However, this analog does 
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Figure 1 In vivo regulatory screen of all 6-mers. (a) Left top: naively concatenating three 6-nners creates an oligomer with multiple 
representatives of A^GCG (red bars). Center and left: cartoon of a de Bruijn graph. Nodes (colored boxes) represent 6-mers; edges (arrows) 
represent overlap. A standard de-Bruijn-sequence library is built from one path that traverses each of 4,096 nodes once. Constructed from multiple 
paths, our MRCC library instead uses one representative for each pair of reverse-complementary 6-mers (green and yellow; self-reverse- 
complementary palindromes in blue), making it nearly 50% smaller (Additional file 1). Right: 16 of 16,384 edges shown. Our algorithm removes 
reverse-complementary paths (black, red) between palindrome pairs and then decomposes the remaining graph into reverse-complementary 
cycles. It allowed us to design an ultra-compact library of DNA sequences with uniform 6-mer coverage, (b) Schematic depicting the sub-cloning of 
each 15-bp multiplexed oligomer into the Elb-Tol2 vector and subsequent injection into single-cell zebrafish embryos, (c) Violin plots depicting 
the distribution of the expression patterns of each tissue at 24 hpf. White lines indicate the fractional expression values for the empty vector 
construct, (d) Scatter plot depicting the method by which we selected consistently expressed multiplexed oligomers whose expression was not 
significantly correlated to minimal-promoter bias. The vertical dotted line denotes the 40% fractional expression threshold that was used, whereas 
the horizontal dotted line corresponds to a false discovery rate-adjusted P-value of 0.05. (e) Histogram depicting the tissue specificity of the 22 
uncorrelated, consistently expressed constructs at 24 hpf. (f) Representative images at 24 hpf for embryos injected with 1EF03 (top) and 3EF09 
(bottom), exhibiting broad expression that was correlated with that of the empty vector expression. The full sequence of each construct (with Xhol 
and Bglll flanking sites) is listed below each figure. Both constructs have a 5' GC-rich region. GFP, green fluorescent protein. 
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not exist for even values of k larger than 2 (Additional 
file 1). 

Using custom software that we have made available [25], 
we created an MRCC library of 208 DNA oligomers, each 
15 bp, with uniform, single-copy coverage of all 6-mers. 
By exploiting 6-mers that overlap junctions between oligo- 
mers and flanking sequence, a modified version of the 
algorithm reduced the number to 184 (Additional file 3). 
This library is about 45% of the minimum size obtainable 
with previous algorithms that give uniform coverage. 

Multiplexed oligomer screen identifies functional units 
driving tissue-specific expression 

Each of the 184 multiplexed 15 -bp oligomers was synthe- 
sized and cloned upstream of the 42-bp adenovirus type 5 
Elb [26] minimal promoter and green fluorescent protein 
(GFP) in a Tol2 retrotransposon-based [27] zebrafish repor- 
ter vector [28]. Each construct was injected into single-cell 
zebrafish embryos, and GFP expression was monitored in 
15 tissues at 24 and 48 hpf (Figure lb). These time points 
coincide with the establishment of the basic vertebrate bau- 
plan and the development of key structures. In addition, 
embryos were injected with an empty vector control (iden- 
tical flanking sequence but no insert) to assess the contri- 
bution of the minimal promoter. A substantial portion of 
the MRCC library (70/184 constructs, approximately 38%, 
at 24 hpf) drove consistent (in >40% of embryos) expres- 
sion patterns, though our design was unguided by a priori 
knowledge and only a small amount of information could 
be encoded in each oligomer. A potential explanation for 
this is that our constructs, like regulatory regions marked 
selectively by histone 3 lysine 4 mono-methylation 
(H3K4mel), have a higher GC content than the zebrafish 
genome (average 43.1% for H3K4mel and 50% for our 
multiplexed oligomers, versus 36.4% genome-wide). An 
alternative explanation is that while the oligomers are very 
short, the design of our MRCC library encodes a maximum 
amount of sequence diversity. It is thus likely that the 
MRCC library contains more active sequences than ran- 
dom genomic sequence of the same size. Our results are 
consistent with a much smaller percentage of the genome 
being active in regulation. 

We observed a diverse range of expression at both time 
points (Figure Ic; Additional files 2 and 3). However, there 
was a clear bias towards amplifying the background bias of 
the minimal promoter (Figure Ic; Additional file 2). 
By removing expression patterns that were significantly 
correlated with background expression (Spearman rank 
correlation coefficient, false discovery rate (FDR) q < 0.05), 
we identified 22 multiplexed constructs that drove GFP 
expression consistently at 24 hpf to at least one tissue 
(Figure Id; Additional file 3). An additional five constructs 
produced uncorrected and consistent expression solely at 
the 48 hpf time point (Additional file 2). 



Tissue specificity is a common feature of regulatory ele- 
ments [29]. We thus examined the tissue distribution of 
constructs expressed consistently in one or more tissue. 
Eleven out of the 22 constructs (50%) that produced con- 
sistent and uncorrected expression at 24 hpf did so in a 
single tissue (Figure le). In contrast, several correlated 
constructs exhibited very broad expression patterns. For 
example, 1EF03 and 3EF09 produced the strongest 
observed signal in ten tissues across the two time points, 
likely due to a GC-rich sequence embedded at the 5' end 
of each oligomer (Figure If). 

We chose a subset of the consistently expressed, tissue- 
specific, uncorrected constructs to pinpoint functional 
sequences within the constructs and the zebrafish genome: 
2CD12/epidermis, 2GH08/brain, 2GH12/notochord, and 
3CD05/spinal cord (Figure 2). 2GH08/brain was unique in 
this set in that it expressed consistently in more than one 
tissue: bilaterally in rostral forebrain neurons at both time 
points and in ganglia of the otic lateral line at 24 hpf. For 
comparison, representative negative embryos injected with 
the empty vector control are depicted in Additional file 2. 
We confirmed that the expression pattern of these four 
constructs was due to genomic integration by producing 
Fl transgenics. In each case, Fl GFP expression resembled 
that of FO embryos (Additional file 2). To address the 
dependence of each sequence on the Elb minimal promo- 
ter, we swapped it with another TATA-box containing 31- 
bp element commonly used in luciferase reporter assays 
(see Materials and methods). Three of the four constructs 
had essentially identical patterns, suggesting that the 
observed tissue-specific activity is guided by the 15-bp 
sequence (Additional file 2). 2GH12/notochord was the 
outlier, suggesting that a sequence within the Elb minimal 
promoter could be needed for its tissue specificity. 

Multiplexed oligomers contain shorter, functional 
sequences 

A previously reported machine-learning algorithm 
assigned weights to 6-mers based on their association with 
putative forebrain enhancers [6], as defined by the binding 
of the ElA binding protein p300 (EP300) in mouse fore- 
brain at embryonic day 11.5. We used these forebrain 
enhancer weights to score each of the 184 multiplexed 
constructs based on the combined weight of all 6-mers it 
contains. Higher scores indicate constructs with sequence 
patterns that are positively associated with EP300 binding 
in embryonic mouse forebrain. Using a logistic regression 
model, we compared the scores of constructs that had 24 
hpf forebrain expression in the top 10% of our study with 
the scores of the remainder of the constructs. This analysis 
identified a significant association between our top fore- 
brain constructs and 6-mers identified by Lee et al (odds 
ratio = 1.17, P = 0.008; Wald test of logistic regression 
coefficient), despite differences in methods, organism, and 
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Figure 2 The four uncorrelated constructs chosen for follow-up experiments due to their consistent, specific expression patterns: 

2CD12/epidermis; 2GH08/brain (showing both forebrain and otic lateral line expression); 2GH12/notochord; 3CD05/spinal cord. White 

rectangles indicate regions tliat are magnified in the right side panels. 
J 



developmental stage between the two studies. We found 
an even stronger association using the top 5% of the fore- 
brain constructs (odds ratio = 1.23, P = 0.008; Wald test). 

For the 18 multiplexed oligomers with forebrain expres- 
sion in the top 10% of our study, an average of only 8 of 
14 distinct 6-mers in each construct have positive weights. 
This suggests that multiple 6-mers in each construct are 
dispensable for forebrain expression. To functionally 



explore this hypothesis in multiple tissues, we performed a 
functional dissection experiment on the four selected tis- 
sue-specific constructs. For each, we removed 6 bp to 
leave the left (L9), middle (M9), and right (R9) 9 bp of 
each 15-bp insert (Figure 3a). In all but one case (2GH12/ 
notochord), a single 9-bp sequence was sufficient to lar- 
gely reproduce the expression pattern of the original mul- 
tiplexed construct (Figure 3b; Additional file 4). 
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We used Tomtom [30] to identify JASPAR [31] and 
UniPROBE [32] position weight matrices (PWMs) that 
were similar to the functional 9-bp (or 15-bp, in the case 
of 2GH12) sequences identified through the functional 
dissection experiments. Although TF binding specificities 
are thought to be conserved, the PWMs in these data- 
bases are not derived from zebrafish TF binding specifici- 
ties, and thus are not ideal. However, in each case, the 
top hit (Figure 3b) suggested TFs or TF families that 
plausibly interact with these sequences. The M9 region 
of 2CD12/epidermis contains a GATA motif on the 
reverse strand, matching the UniPROBE PWM for GataS 
{P = 0.001, Tomtom empirical P- value [30]) but likely 
corresponding to Gata3 in zebrafish, which binds to a 
very similar motif in vitro (Additional file 2) and is a reg- 
ulator of epidermal cell lineage [33]. Likewise, the L9 
region of 3CD05/spinal cord matches the UniPROBE 
PWM for HoxalO {P = 0.0005), although this motif likely 
is recognized by a more medially expressed hox in zebra- 
fish, such as Hoxa9a. Mammalian HoxalO and Hoxa9 
binding motifs are extremely similar, and hoxa9a expres- 
sion precisely mirrors the activity of 3CD05/spinal cord 
(Additional file 2). The L9 region of 2GH08/brain 
strongly matches the JASPAR PWM for Esrrb {P = 
0.001), the zebrafish homolog of which is expressed in 
forebrain and hindbrain nuclei, recapitulating the expres- 
sion pattern driven by the 15-bp multiplexed oligomer 
(Additional file 2). 2GH12/notochord strongly matches 
the UniPROBE PWM for Foxa2 {P = 0.0009), which is 
expressed in the zebrafish notochord at the Prim-5 stage 
(approximately 24 hpf) [34]. The longer length of this 
motif provides a clue as to why 2GH12 could not be 
functionally dissected. In summary, the functional dissec- 
tion experiments suggest the expression patterns of the 
four tissue-specific multiplexed oligomers are largely due 
to the action of shorter, 6- to 9-bp sequences. 

Design of synthetic regulatory elements from smaller 
functional units 

Regulatory sequences often contain clusters of homotypic 
TFBSs, with an average of five per cluster [35]. To deter- 
mine whether this principle could aid in the design of syn- 
thetic regulatory elements, we generated concatemers of 
each of the four tissue-specific sequences, with five copies 
of each 15-bp insert, and tested them in zebrafish. 
5x2CD12/epidermis, 5x2GH12/notochord and 5x3CD05/ 
spinal cord produced more intense and specific expression 
patterns than their single-copy counterparts (Figure 4a; 
Additional file 2). In contrast, concatenation of 2GH08/ 
brain abolished specific expression in the forebrain and 
otic lateral line entirely (Figure 4a). Combined with the 
functional dissection result, these data suggested that the 
Xhol-flanking sequence is necessary for proper function of 
the 2GH08 sequence. In a second round of concatemer 



design, we concatenated repeats of the functional 9-bp 
sequence from the functional dissection experiments (L9) 
along with the 6-bp Xhol flanking region. The construct 
XhoI-2GH08(L9) produced consistent expression (67% of 
embryos expressing) in rostral forebrain and otic lateral 
line in a more intense and specific pattern than 2GH08/ 
brain (Figure 4b; Additional file 2). These results demon- 
strate that the strength of a regulatory element can be 
tuned by simple repeats of short, key regulatory sequences 
in vivo. 

Large, distal-acting enhancers (average length 1,529 
bp) have been shown to autonomously drive diverse 
expression [36]. However, the internal circuitry of indi- 
vidual enhancers is much more complex, requiring the 
precise spacing and ordering of regulatory sequences 
[37-40]. To assess the syntactic rules that apply at the 
scale of our synthetic regulatory elements, we assayed 
tandem combinations of two or three concatemers. A 
tandem combination of brain (5xXhoI-2GH08(L9)) and 
skin (5x2CD12) concatemers abolished expression in 
skin and forebrain but not the otic lateral line. A tan- 
dem combination of brain and notochord (5x2GH12) 
concatemers, as well as a tandem brain, skin and noto- 
chord triplet abolished regulatory element activity in 
those tissues (Figure 5a,c). In contrast, injecting pairs of 
concatemers as separate constructs yielded consistent, 
combinatorial expression of the two patterns (Figure 5b, 
c). In each tandem combination, the XhoI-2GH08(L9)/ 
brain concatemer was the most distant from the mini- 
mal promoter, while the intervening sequence and the 
distance to the minimal promoter varied. The fact that 
two intervening sequences abolished the activity of the 
brain concatemer (brain+notochord and brain+skin 
+notochord), but another (brain+skin) only limited its 
spatial extent, suggests the local context of a regulatory 
element is essential to its function. These data are con- 
sistent with the idea that large, autonomous enhancers 
contain insulator elements and/or signals mediating 
long-range interactions [37] that are absent in our syn- 
thetic regulatory elements. 

Short, tissue-specific oligomers are essential components 
of native regulatory elements 

To explore the native function of the multiplexed 
sequences in the zebrafish genome, we identified construct 
alignments in H3K4mel -positive, H3K4me3- negative pre- 
dicted-enhancer regions [12] and then employed GREAT 
[41] to identify significantly enriched gene ontology terms 
for each construct. For each of the four 15-bp constructs, 
we identified one or more significantly enriched biological 
processes that were relevant to the tissue expression pat- 
tern (binomial test, FDR q < 0.05; Figure 6a; Additional 
file 5). We further aligned 9-bp sequences identified in the 
functional dissection experiments and performed the same 
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Figure 3 Functional dissection of multiplexed sequences reveals key motifs (a) We tested three deletion mutants for each of the four 
tissue-specific multiplexed sequences, using the first (L9, peach), middle (M9, green) and last (R9, blue) 9 bp of each sequence cloned between 
the same flanking sequences, (b) With one exception (2GH12/notochord), a 9-bp sequence was sufficient to drive the expression pattern of the 
original multiplexed construct. Percentages offish expressing in the appropriate tissue at 24 hpf are shown, with the value for the functional 
9-mer highlighted in black. The top-matching position weight-matrix identified by Tomtom [30] is listed for each construct. 



analysis in GREAT, identifying several additional enriched To determine if these sequences contribute to zebra- 
terms that matched the corresponding expression pattern fish regulatory modules, we picked ten 2GH12/noto- 
of the functional 9-bp sequence (Figure 6b; Additional chord and ten 2GH08(L9)/brain alignments that overlap 
file 6). 24-hpf whole-embryo H3K4mel+/me3- ChlP-seq peaks. 



Smith et al. Genome Biology 2013, 14:R72 
http://genomebiology.eom/content/14/7/R72 



Page 8 of 1 5 



Xhol 1 


15mer 


15nner 


15mer 


15mer 


15mer 


Bgi1l| 

















5X2CD12 
82/105 embryos 



5x2GH12 
76/95 embryos 




5X3CD05 
28/31 embryos, 



5X2GH08 
49 embryos 



J Xhol 2GH08 Xhol 2GH08 ^hol 
L9 L9 


2GH08 Xhol 2GH08 ^hoi 2GH08 ,„] 
L9 L9 L9 J 








5xXhol-2GH08(L9) 




37/55 embryos 



Figure 4 Design of synthetic enhancers from short multiplexed sequences, (a) We generated concatemers containing five copies of eacli of tine 
six multiplexed sequences. Three of these (5x2CD12/epidermis, 5x2GH12/notochord, 5x3CD05/spinal cord) drove more intense expression in the same 
tissue as the original 15-bp sequence. The fourth construct, 5x2GH08/brain, did not produce any detectable expression, suggesting that the Xhol- 
flanking sequence may be important. White arrows mark the epidermis, notochord, and spinal cord, respectively, (b) A second version of the 2GH08/ 
brain concatemer contained five tandem repeats of the Xhol flanking region with the 2GH08(L9) functional sequence that was identified by the 9-bp 
dissection experiments. This arrangement produced a strong regulatory element that drives consistent expression to the forebrain and otic lateral line 
at 24 and 48 hpf. White arrows mark the forebrain and otic lateral line patterns. All pictures are of 24 hpf embryos. 
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Figure 5 Combination of synthetic regulatory elements underscores the importance of context, (a) Left: a tandem combination of the 
5xXhol-2GH08(L9)/brain and 5x2GH12/notochord concatemers failed to produce any enhancer activity above background. Right: a tandem 
combination of the 5xXhol-2GH08(L9)/brain and 5x2CD12/epidermis concatemers resulted in consistent otic lateral line expression. Expression in 
the forebrain and epidermis was detected inconsistently in a handful of embryos (such as the one shown), but not as intensely as the original 
concatemer. (b) Injection of a mixture of the 5xXhol-2GH08(L9)/brain and 5x2GH12/notochord (left) or 5xXhol-2GH08(L9)/brain and 
5x2CD12/epidermis constructs (right) resulted in the expected combinatorial pattern (white arrows), (c) Percentage of embryos expressing in the 
epidermis, forebrain and notochord for the 5x concatemers individually, as combinations or as a mixture of two constructs. Tandem 
arrangement of concatemers in the same regulatory element largely failed to reproduce the expression patterns driven by concatemers 
individually or in a mixture, with the exception of the brain-epidermis combination, which preserved otic lateral line expression. 
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Figure 6 Short sequences contribute to endogenous developmental zebrafish enhancers. (a,b) Alignments of 15-bp multiplexed 
sequences and 9-bp sequences identified by functional dissection were intersected with predicted enhancers (defined by H3K4mel -positive, 
H3K4me3-negative ChlP-seq regions [12]). We then used GREAT [41] to find enriched ontology terms for nearby genes. In each case, a term 
consistent with the expression pattern was identified, suggesting that the sequences identified by our screen are performing specific roles in 
developing vertebrates. Plotted are binomial fold enrichment values for the specific 15-bp or 9-bp alignments. Functional 9-bp sequences are 
depicted by the letter F. To control for common developmental gene ontology terms, we include the binomial fold enrichment values for 
the H3K4mel+/me3- alignments of all 184 constructs (a) or the non-functional 9-bp sequences identified by functional dissection (b). We 
tested enhancer activity of 20 1,000-bp regions from the zebrafish genome that overlap with H3K4mel/me3- signal and contain an 
alignment for 2GH12/notochord or the 2GH08(L9)/brain sequence, (c-e) Three often 2GH1 2/notochord-containing regions (Ncl-3) showed 
strong notochord (Nc) enhancer expression at 24 hpf. (f-j) Five often 2GH08(L9)/brain-containing sequences (Fbl-5) were strong forebrain 
(Fb) enhancers at 24 or 48 hpf. The chromosomal coordinates of each enhancer are indicated (Zv9 zebrafish genome assembly) as well as 
the time point with the strongest expression, (kj) Deletion of the 2GH1 2/notochord or 2GH08(L9)/brain sequence from the 1,000-bp 
enhancers leads to a reduction and, in several cases, abolition of the observed enhancer signal. Bars represent the percentages of fish 
exhibiting expression in the notochord (k) and forebrain (I) at the indicated time point provided in the pictures above. The red dotted line 
represents the 40% consistency cut-off used for this study. 



We cloned approximately 1,000-bp fragments, including 
the alignment and overlapping H3K4mel signal, and 
tested them for enhancer activity in zebrafish. Three 
2GH12- and five 2GH08(L9) -containing fragments drove 
consistent (>40% of fish expressing) expression to noto- 
chord and forebrain, respectively (Figure 6c-j; Additional 
file 7). A random selection of ten H3K4mel+/me3- 
regions identified two strong forebrain enhancers and two 
weak enhancers, driving expression consistently yet non- 
specifically to the notochord and forebrain, respectively 
(Additional files 2 and 7). Together, these results indicate 
that these multiplexed sequences could be used in con- 
junction with other genome-wide datasets to improve the 
identification of functional, tissue-specific enhancers. 

To determine the relative strength with which our syn- 
thetic elements can drive expression, we performed quanti- 
tative PGR on pools of 25 GFP-positive embryos injected 
with the original 15-bp insert, the 5x concatemer, or a 
strong native enhancer (Nc2/Fb2) corresponding to 
2GH08/brain or 2GH12/notochord (Additional file 2). In 
each case, the native enhancer produced higher GFP 
expression than both the 15-bp insert (3.0- and 10.4-fold 
for brain and notochord, respectively) and the 5x concate- 
mer (2.5- to 5.0-fold, respectively), suggesting that synthetic 



15-mers and 5x concatemers are not as strong as native 
enhancers. 

We next performed deletion mutagenesis on each posi- 
tive 1,000-bp enhancer, removing the 2GH12/notochord- 
and 2GH08(L9)/brain-aligned regions to determine if they 
are required to drive the specific expression patterns 
observed. The deletions reduced the specific expression of 
the enhancer in seven out of eight cases and abolished it 
for one of three positive 2GH12- and two of five positive 
2GH08(L9) -containing enhancers (Figure 6k,l; Additional 
file 7). These results demonstrate that the multiplexed reg- 
ulatory sequences identified by our screen are necessary 
for the proper spatiotemporal expression of specific 
enhancers in the developing zebrafish. 

Discussion 

In this study, we made several observations on the role of 
TFBS-sized sequences in gene regulation in vivo. First, we 
identified a considerable number of very short DNA 
sequences that are drivers of consistent, tissue-specific 
expression. A previous report carried out in various cell 
lines [20] suggested, as a rationale for screening concate- 
meric sequences, that such short sequences could not 
drive observable levels of transcription. Our results suggest 
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that this is not the case, and that there exists an intermedi- 
ate stage between neutral sequence and fully-fledged regu- 
latory elements. We also showed that expression levels 
can be amplified in vivo through homotypic concatenation, 
a principle that has previously been demonstrated primar- 
ily in in vitro transient transfections [38,42,43]. Together, 
these findings suggest a potential mechanism through 
which short sequences could gain a toehold as minimal 
regulatory elements and evolve into more complex 
clusters. 

A limitation of this approach is that each multiplexed 
construct contains multiple /c-mers, requiring further 
bashing to determine functional sequences. However, 
such functional dissection is negligible compared to what 
would be required for a typical enhancer of more than 
1,000 bp. Furthermore, multiplexing 6-mers into 15-bp 
oligomers identified sequences such as 2GH12, which 
was the only construct to drive notochord-specific 
expression and could not be broken down into smaller, 
functional units. This result, in particular, points to the 
potential benefits of testing higher order /c-mers, ideally 
without multiplexing. 

We additionally demonstrate that short /c-mers (9 to 15 
bp) are important for the proper spatiotemporal expres- 
sion of a subset of distal-acting regulatory elements. Thus, 
even though our multiplexed sequences were generated 
synthetically, they harbor endogenous signals that confer 
tissue-specificity. Using a straightforward TFBS analysis 
tool and publicly available data, we identified potential 
interacting proteins for all four tissue-specific signals. 
However, these matches were not perfect, highlighting the 
need for more in vivo testing of short sequences. 

Another contribution of this work is a mathematical and 
computational framework for the design of ultra-compact 
DNA libraries with single-copy representation of pairs of 
reverse-complementary /c-mers. We show here that the 
reverse complementarity of DNA can be exploited to con- 
struct an MRCC library, which is about half the size of the 
analogous de Bruijn sequence library and yet still maintains 
uniform, minimal coverage of /c-mers. Traditional de Bruijn 
sequences have proven to be extremely useful for biological 
applications, including protein-binding microarrays [3], 
universal DNA tag systems [21] and microfluidic affinity 
analysis devices [22]. Our formal analysis of the structure 
of the de Bruijn graph induced by the reverse complemen- 
tarity of DNA should aid in the advancement of these and 
other specialized techniques that use /c-mer libraries 
[20,44]. In addition, the use of programmable microarrays 
to generate barcoded libraries of sequences for functional 
testing would provide an enticing platform for applying 
MRCC libraries. These 'massively parallel reporter assays' 
(MPRAs) have so far only been used to test variants of an 
individual promoter or enhancer [45-47], but could be 
used to screen /c-mer libraries on a larger scale. 



Extending such methodologies to higher k may aid in 
the discovery of drivers for tissues such as the eye, otic 
vesicle, and fins, which were poorly represented in the 
6-mer screen. For these tissues, it is entirely possible 
that 6-mers are simply not sufficient to drive expression. 
Although our screen covered 100% of 6-mers, it only 
captures 32% of all 7-mers, 9% of all 8-mers, and a 
mere 2% of the 262,144 unique 9-mers. Such methods 
could be combined with transgenic lines with secondary 
markers for specific tissues of interest to improve 
annotations. 

Finally, as evidenced by a homotypic concatemer 
whose expression is affected by downstream sequences, 
we show that local context affects enhancer function. 
Our findings are consistent with previous reports that 
show that the function of regulatory elements is sensi- 
tive to changes of spacing and order [37-40], but could 
also imply that our synthetic enhancers are missing key, 
yet undiscovered, signals typical of endogenous enhan- 
cers. While deciphering the rules of syntax by testing 
individual combinations is technically chaUenging, this 
process could also be accelerated by testing rearrange- 
ments of a known enhancer in parallel, using MPRAs. 
One such example is the recent use of MPRA technol- 
ogy to characterize the regulatory logic of promoters in 
yeast [43]. Such large-scale functional dissections wUl 
aid in the design of synthetic regulatory elements that 
can drive molecules to a specified level of expression in 
any combination of tissues and time points. 

Conclusions 

Our unique catalog of short, functional sequences is a 
good starting point in the search for the basic functional 
units contributing to regulatory element activity. These 
sequences could be used to improve genome annotation 
and to interpret the effects of non-coding genetic var- 
iants. Another application is in the design of synthetic 
regulatory elements that can drive expression of mole- 
cules to specific tissues and time points. Using only a 
small subset of the words' in our 6-mer regulatory cata- 
log, we demonstrate a simple type of 'sentence' that 
obeys the syntactical rule of amplification through repe- 
tition. Attempts to construct more elaborate sentences 
revealed a non-modular grammar at the scale of TFBS- 
sized elements, highlighting the need for functional 
assays with higher throughput. 

Materials and methods 

Ultra-compact 6-mer library design 

Although we chose k = 6, our methods are applicable in 
principle to larger values of k and libraries of varying 
size. We restricted the design to include just one /c-mer 
out of each reverse-complementary pair of /c-mers (VcRC- 
pair ). For even values of /c, a /cRC-pair may consist of a 
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single self-reverse-complementary palindrome, rather than 
two sequences. For an RC-pair X and collection S of m 
DNA sequences, x{X,S) denotes the total number of times 
that the /c-mers in X appear as substrings in all sequences 
ofS, For example, if X = {ATTTTA, TAAAAT} and S = 
{GATTTTATTTTAGGG, TAAAATGGATTTTAA}, 
then x{X,S) = 4. We introduced the concept of a MRCC 
library of order k and size m, which, formally, is a collec- 
tion S of m DNA sequences such that for every /cRC-pair 
X, XiX^S) = L 

The /c-dimensional de Bruijn graph (Figure la; Addi- 
tional file 2) has n/^ vertices, each labeled by a unique 
/c-mer, and W/^+idirected edges, each labeled by a unique 
(/:+l)-mer. The edge labeled by x is directed from the ver- 
tex labeled by the first k bases of x to the vertex labeled by 
the last k bases of x. Every path, i.e., sequence of edges, 
corresponds to a DNA sequence: the label of the first edge 
gives the first k+1 bases of the sequence, and for / > 2, the 
final base in the label of the i-th edge gives the (/:+/) -th 
base in the sequence. The reverse complement of a path t 
is also a path: it consists of the reverse complements of 
the edges in t, traversed in reverse order. The (/:-l)-dimen- 
sional de Bruijn graph is used to efficiently generate a de 
Bruijn sequence of order k via the computation of an 
Eulerian cycle, that is, a single cyclical path that traverses 
every edge in the graph exactly once (Additional file 2). 
There is an equivalent path that traverses every node 
exactly once in the more visually intuitive /c-dimensional 
de Bruijn graph, described in Figure la. 

To create an MRCC library, we designed an algorithm 
(Additional file 1) that partitions the edges of the de 
Bruijn graph such that there are at most two palindromes 
per path (Additional file 2). Our algorithm has two main 
stages. Stage 1, which is only executed for even values of 
k {k > 2), iterates over pairs of palindromes, removing 
from the graph two reverse-complementary paths 
between each pair. The paths correspond to reverse-com- 
plementary sequences, one of which is added to the 
library. Stage 2 partitions the remaining edges into two 
reverse-complementary cycles. The sequence that corre- 
sponds to one of those cycles is then divided into the 
remaining number needed of consecutive oligomers such 
that the final k-l bases of each sequence overlap the first 
k'l bases of the next (Additional file 2). The overlap is 
required to maintain /cRC-pair coverage. These oligomers 
complete the library. 

The algorithm is defined for any integers k and m such 
that k >0 and Vip/^ <m <{V2nk - V^kpk) (for k = 2, the range 
is 0 <m <V2nj^), For k = 6, m must be between 32 and 
1,952, which meets the constraints of our construct 
design, which required m to be a feasible number for in 
vivo testing. 

Stage 2 of our algorithm, which is an adaptation of the 
classic Eulerian-cycle algorithm by Hierholzer [48], is 



similar to a heuristic algorithm sketched by Mintseris 
and Eisen [24] that attempts to partition all edges of the 
(/c-l) -dimensional de Bruijn graph into just two reverse- 
complementary cycles. For even values of k larger than 
2, it is not possible to partition the edges this way; every 
MRCC library has size m > 1 (Additional files 1 and 2). 
For odd values of k and k = 2, this edge partitioning 
does exist (Additional file 2); one of the two cycles is 
equivalent to an MRCC sequence. 

If the initial bases of an oligomer insert match the 
final bases of the left flanking sequence, then those 
bases can be eliminated from the oligomer to reduce its 
length without sacrificing its /cRC-pair coverage. The 
same is true for the final bases of the oligomer and the 
right flanking sequence. We created a modified version 
of the algorithm that uses heuristics during cycle-finding 
and sequence-splitting to produce oligomers with these 
characteristics. This technique further reduced library 
size and the number of /c-mers repeated across the junc- 
tions between oligomers and flanking sequences. 

Software implementation 

We implemented the original algorithm and used it to 
produce an initial construct design for k = 6 that contains 
m = 208 oligomers, each of length 15 bp. The number N/^ 
of /cRC-pairs is Nj^ = pj^+ {V2){nj^ - pj^ = {V^jin/^ + p/^), where 
= 4^ is the number of /c-mers and p/^ = 4^^^^ is the num- 
ber of palindromes. For k = 6,N/^= 2,080, ri/^ = 4,096, and 
P/^ = 64, which implies that the initial design is optimally 
compact, excluding flanking sequences. With our imple- 
mentation of the flanking-sequence-based, modified 
version of the algorithm, we reduced the number of oligo- 
mers by more than 10%, from 208 to 184, still each 15 bp. 

Due to Perl's facility in handling sequences, all imple- 
mentations are in Perl. They can be adapted for better 
time and space efficiency, if needed for larger values of 
/c, using a compiled programming language, such as C, 
and specific data structures (Additional file 1). All code 
is available from [25]. 

Plasmids and constructs 

For the primary screen and deletion series, each oligomer 
(Additional file 3) and its reverse complement were 
synthesized (Bioneer Inc., Daejeon, Republic of Korea) 
with 5' and 3' overhangs, annealed and then cloned 
between Xhol and Bglll restriction sites of the zebrafish 
enhancer vector Elb-Tol2 [28]. For the endogenous 
enhancer screen, 1,000-bp regions were amplified from 
zebrafish genomic DNA using CACC flanked primers, and 
then cloned into pENTR using the TOPO Cloning kit 
(Life Technologies, Carlsbad, CA, USA). Cloned enhancers 
were subsequently transferred to the Elb-Tol2 vector 
using the LR Clonase II cloning enzyme (Life Technolo- 
gies). For the concatemeric constructs, 150- to 225-bp 
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regions were synthesized and cloned into Elb-Tol2 (Bio- 
matil< USA, LLC, Wilmington, DE, USA). 

Zebrafish injections 

Constructs were injected with Tol2 by standard techni- 
ques [27,49]. An average of 60 embryos were injected for 
the primary screen of 184 oligomers, with the secondary 
requirement that at least 25 embryos survived to 48 hpf. 
The expression patterns of the four tissue-specific con- 
structs chosen for follow-up experiments were monitored 
in at least 100 embryos at each time point. For the alterna- 
tive minimal promoter screen, we replaced the Elb mini- 
mal promoter with the 31 -bp minimal promoter region 
from the pGL4.23 luciferase reporter vector (Promega, 
Madison, WI, USA). 

GFP expression was recorded in 13 tissues at 24 hpf and 
15 tissues at 48 hpf according to established ZFIN nomen- 
clature [50]. At least two images (one lateral, one dorsal or 
ventral) were acquired for each construct, with more 
images taken for constructs showing consistent expression 
patterns. A complete collection of screen data and a col- 
lection of image data for consistently expressing constructs 
are available from our website [51]. 

Correlation analysis 

For each multiplexed construct, fractional expression 
values were determined by dividing the number of 
embryos expressing in a given tissue by the total number 
surviving at each time point. The same values were 
obtained for an "empty vector" control construct lacking 
an oligomer insert. Correlation between these two profiles 
was then determined for each construct and time point 
using the Spearman's rank correlation coefficient. Con- 
structs with an FDR-adjusted P-value <0.05 were consid- 
ered correlated to the empty vector. 

Support-vector machine forebrain enhancer weights 

Using the 6-mer weights assigned by a previously reported 
support vector machine (SVM) that predicts EP300- 
binding in forebrain [6], we computed a score for each 
construct by summing the weights of the 6-mers in the 
construct (15-bp oligomer insert with 2-bp flanking 
sequence on each end). We binarized our expression data, 
assigning 1 to a construct if its forebrain expression value 
was in the top 10% (or 5%) of all constructs at 24 hpf, and 
0 otherwise. Those data were used to fit a logistic regres- 
sion model using R [52], with construct score as the 
predictor variable. We tested for statistical significance of 
the association between the score and the probability of 
forebrain expression in the top 10% (5%) using a Wald test. 

Transcription factor binding site analysis 

Nine or 15-bp (in the case of 2GH12) functional sequences 
were compared to known vertebrate JASPAR [31] and 



UniPROBE [32] PWMs using Sandelin-Wasserman simi- 
larity [53] in Tomtom [30]. The most statistically signifi- 
cant PWM match to each construct was identified. 

Alignments to the zebrafish genome 

Fifteen-base-pair inserts were aligned to the zebrafish 
genome (build Zv9) using bowtie vO. 12.730 [54], allowing 
three mismatches and masking repeats (UCSC Genome 
Browser [55] RepeatMasker track). For the shorter, func- 
tional sequences identified by the functional dissection 
experiments, we aligned 11-mers (9 bp + 2 bp of flanking 
sequence), with one mismatch allowed. 

Enrichment analyses 

GC content was computed from the Zv9 assembly, with 
repeats masked. Enrichment analyses were performed 
using Fisher's exact test. P-values were Bonferroni 
adjusted. 

We predicted enhancer regions using previously 
reported histone methylation data [12] available from the 
UCSC Genome Browser [55]. We used the UCSC table 
browser to intersect 15- or 11 -bp alignments with pre- 
dicted enhancers, and fed the results into GREAT [41] 
using the default settings, with an FDR-adjusted binomial 
P- value threshold of 0.05. 

Quantitative PGR 

Twenty-five GFP expressing embryos were pooled for each 
injected construct (as well as an uninjected control pool) 
and dounce homogenized in TRIzol reagent (Life 
Technologies). Total RNA was extracted according to the 
manufacturer's instructions, and subsequently cleaned up 
using the RNeasy mini kit (Qiagen, Hilden, Germany). 
To reduce genomic DNA, we performed an on column 
digestion with the RNase-free DNase set (Qiagen). Total 
RNA (1.5 (ig) was processed into cDNA using the 
High-Capacity cDNA Reverse Transcription Kit (Applied 
Biosystems, Foster City, CA, USA). For each reaction, we 
performed a control without the reverse transcriptase 
enzyme (no RT) to quantify residual genomic GFP. We 
performed Taqman assays for GFP or beta-actin 1 {actbl) 
using the TaqMan Universal PCR Master Mix, No 
AmpErase UNG reagent (Applied Biosystems), in tripli- 
cate. GFP expression was first normalized to the no RT 
conditions to single out GFP RNA, then to actbl as a 
loading control. 

Additional material 



Additional file 1: Supplemental note. Two results that justify and give 
a mathematical proof of correctness of our formal algorithm (in box), 
which we used to construct the oligomer library. 

Additional file 2: Supplemental figures. Seven supplemental figures 
and legends. 
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Additional file 3: Supplemental Table 1. Multiplexed sequence screen 
data for 184 constructs tested using zebrafish transgenic assays. 

Additional file 4: Supplemental Table 2. Embryo counts from the 
functional dissection experiments. 

Additional file 5: Supplemental Table 3. Enriched gene ontology 
terms for 15-bp alignments of the four tissue-specific multiplexed 
oligomers. 

Additional file 6: Supplemental Table 4. Enriched gene ontology 
terms for 9-bp (+2 flanking bp) alignments of the three tissue-specific 
multiplexed oligomers. 

Additional file 7: Supplemental Table 5. Genomic position and 
embryo counts for the thirty approximately 1,000-bp regions tested for 
enhancer activity. 
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